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Improvement of algorithms for dynamical overlap fermions 



1. Introduction 

The JLQCD Collaboration is performing dynamical QCD simulations with the overlap fermions, 
as a new project started in 2006 [Jl], ^, ||, |J]. At present, our main run is generating lattices of size 
16 3 x 32, a ~ 0.12 fm, with two flavors of sea quarks whose smallest mass ~ m s /6. The topolog- 
ical sector is fixed by a pair of extra Wilson fermions. This considerably improves the efficiency 
of the HMC algorithm, while the effect of fixing the topological charge Q should be examined by 
measuring on configurations with different values of Q. Numerical simulations at different values 
of Q, as well as with larger lattices and with 2+1 flavors, are also in preparation. 

These studies are being carried out on a new supercomputer system at KEK, which is in service 
since March 2006 [Q]. The system has two computational servers: Hitachi SRI 1000 model Kl 
(peak performance 2. 15TFlops), and IBM System Blue Gene Solution (57.3TFlops). The latter 
system has 10 racks, each composed of 1024 nodes (2048 processor cores). For the Wilson fermion 
solver, with data on cache, the Blue Gene system achieves about 29% of the peak performance on 
a half -rack system. 1 

Even on these powerful machines, the dynamical overlap simulation is quite challenging. The 
simulation is performed with the Hybrid Monte Carlo (HMC) algorithm. In this paper, we describe 
our attempt to speed-up the simulation by improving the HMC algorithms. 



2. Action 



The effective action we treat in the HMC simulation has a form, 

S = S G + S F + S E . (2.1) 

So is the gauge field part, for which we adopt the renormalization group improved (Iwasaki) action, 
while in an exploratory stage of this work a modified plaquette gauge action, which is designed to 
suppress dislocations, was also investigated 

As the quark action, we use the Nf = 2 overlap fermion action. The overlap-Dirac operators 
is represented as 

D(m) = mo + - + m - - y 5 sign(# w ), (2.2) 



2 

where Hw = YsDw, Dy? is the Wilson-Dirac operator with a large negative mass —mo. The sign 
function in Eq. (2.2) is approximated by a partial fraction form [[J ||]: 

sign(# w ) = ^%^H W (po + t -HT 1 — J • (2-3) 

The N inversions (H^ + qi) 1 are calculated at the same time using the multi-mass conjugate 
gradient method. 

We utilize the mass preconditioning [g], i.e. introducing a preconditioning term with heavier 
quark mass m' than that of the dynamical quark. The fermion action becomes Sp = Sppi +Spp2, 

S PF1 = <$>l[D{m'fD(m'))- x <$> u S PF2 = $ {D(m')[D(mf ' D(m)]- l D(m ! )^} 2 , (2.4) 



We thank J. Doi and H. Samukawa of IBM Japan for tuning the Wilson kernel on the Blue Gene system. 
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Figure 1: The low-lying modes of Hw in the case with Se (p. — 0.2, left panel) and without Se (right panel) 
at a ~ 0.125 fm and m ~ 2m s . 



where 5pf i is the preconditioner and Spf 2 is the preconditioned dynamical quark term with corre- 
sponding pseudofermion fields ty\ and (j) 2 , respectively. 
Se is the extra Wilson fermion term defined as 

r2 



det {j^rf-i) = J &X*9X™p[-Se], (2-5) 



X, (2-6) 



(D w + it%/x)(D^D w ) \D w + iy 5 ^ 

where the denominator of Eq. ( ^B] ) amounts to two flavors of heavy ghosts with a twisted mass ju. 
This term suppresses near-zero modes of Hw, while keeping the effects on higher modes minimal 



[g, g, [X0|, [llj]. The newly introduced fields have mass of the order of lattice cutoff and therefore 
irrelevant in the continuum limit. 

The quark action becomes singular when Hy? has a zero eigenvalue. This causes discontinuity 
in the conjugate momenta when A„„„ changes the sign during the molecular dynamics evolution. 
While this problem can be circumvented by the so-called reflection/refraction prescription [[T3|] , 
it requires monitoring of the near-zero eigenvalues and additional inversions of overlap operator, 
which largely increase the numerical cost. In our case, however, extra Wilson fermions prevent the 
near-zero mode from approaching zero, and hence these operations can be skipped. 

How the extra Wilson fermions work is depicted in Figure [jj The figure compares appearance 
of low-lying eigenmodes during the HMC runs with and without Se at a ~ 0.125 fm and m ~ m s . 
It is clear that Se suppresses the spectral density around A = 0. The same feature is found in the 
molecular dynamics history of the lowest mode displayed in Figure ||| With Se, no reflection nor 
refraction occurs, contrary to the case without Se- One can therefore switch off the monitoring of 
A in the case with Se- Even when A = occurs due to a finite molecular dynamics step size, it is 
signaled by large AH and thus rejected in the Metropolis test. 



3. Solver algorithms 

Since the inversion of the overlap-Dirac operator is the most time consuming part of the HMC 
simulation, improvement of the solver algorithm is crucial. We compare two methods: the nested 
CG with relaxed precision of the inner CG loop, and the 5-dimensional CG algorithm. 
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Figure 2: Evolution of A ml „ in the molecular dynamics at a ~ 0.12 fm and m ~ 2m s . The left and right 
panels show the cases with and without Se, respectively. The left panel shows an event that as if a reflection 
occurs, while is actually not the case. 



The overlap operator requires computation of the partial fraction terms in Eq. (|2.3|). Therefore, 
the CG method to invert the overlap operator has a nested structure; the inner loop to calculate 
(H w + qi)~ l , and the outer loop to operate D(m). For the inner loop, the multi-shift CG method is 



used to solve (H w -\-qi)~ l for all / simultaneously. The precision of the approximation Eq. (2.3) is 
determined by the degree N and the condition number A max /A m! „. For a smaller |A mi „|, a larger N 
is needed to keep the precision; e.g. N=10 corresponds to 0(1O -7 ) accuracy for |A mi „|=0.05 and 
O(10~ 5 ) for 0.01. The multi-shift CG method has an advantage that the cost is almost independent 
of ,/V. Instead of extending the window [|A m ,„|, (Am^l] for small |A m ,„|, we may project out the low- 
lying modes explicitly and add back with the eigenvalue sign(A). In this way we may fix the lower 
limit of the approximation to some threshold X t h rs , below which the eigenmodes are treated exactly. 
The relaxed CG method is an improvement of the nested CG method. It changes the precision 



of the inner loop adaptively as the outer loop iteration proceeds [14]. As we will see, the relaxed 
CG is about twice faster than the original CG. 



An alternative solver is the 5-dimensional CG method Q15[]. Let us consider the following form 
of a 5D matrix (an explicit example for the N = 2 case): 



Ms 



( H w —^Jqi 
-\ft]2 -Hw 



Hw 



'<1\ 
-Hw 



\ ^pi 









~P~2 



P\ 



\ 




(3.1) 



Rj5+P()Hw J 



Each component represents the usual 4D matrix. By the Schur decomposition, 



M 5 = LDU 




where 



S = D - CA l B = Ry 5 + PqH w + H w Y -± 

i H w 



(3.2) 



(3.3) 
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Figure 3: Comparison of the solver algorithms on a single configuration. The left panel shows convergence 
of the residue as the number of Dw multiplication. The right panel shows quark mass dependence of the 
number of Dw multiplication needed for |r| 2 /|Z?| 2 < 10~ 14 . 



expresses the partial fraction approximation of D(m). Therefore, by solving 




(3-4) 



y/4 = S %4 is determined. A preconditioning is applied by multiplying the inverse of = 
M$ [U = 0] , which is easily inverted by forward and backward substitutions. The even-odd precon- 
ditioning is also applicable, and according to our performance comparison, this is the best solution 
for the 5D solver. Since the size of M5 grows as N, the numerical cost increases linearly in N. 
A disadvantage is that the subtraction of low-modes of Hw is not applicable when the even-odd 
decomposition is used. This causes a difficulty when A m ,„ becomes too small to be approximated 
without the projection. To apply the 5D solver, one needs to determine the lowest boundary V m ;„ 
above which the partial fraction approximation is valid. 

Figure || shows the comparison of the solver algorithms at a ~ 0.12 fm, m ~ 0.4m s , on a single 
configuration. The figure shows that the relaxed CG is factor of 2 faster than the standard CG 
method. The 5D solver is even faster by another factor of 2-3 than the relaxed CG for N = 20. This 
conclusion is independent of the quark mass, as displayed in the right panel of Fig. |3[ Therefore, if 
near-zero eigenvalues do not appear, as in the present case, the 5D solver is the fastest. 



4. HMC algorithm 

Multi-time step. Magnitude of the forces corresponding to the terms, So, Spfi, Spf2, and Se has 
a hierarchical structure. In particular the gauge part has the largest contribution to the evolution of 
the conjugate momenta, while the cost to compute it is negligible compared to the fermionic part. 



The size of the force for Spp2 is smaller compared to that of Spei- The multi-time step [|12J makes 
use of this hierarchy by adopting different time steps for these terms in the molecular dynamical 
evolution. 

The forces are compared in Fig. |j. This result suggest to chose the step sizes as 

AT (i » F2 ) > AT (m) > At (G ) = At (£ ) . (4. 1) 
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Figure 4: The maximum values of the forces (left panel) and the costs of the forces (right panel) monitored 
in HMC at a ~ 0. 12 fm and m ~ 2m s . 



While the size of the force for Se is as small as the fermionic part, Ai( E ) is set to be the same as the 
gauge part, to ensure the disappearance of the near-zero modes, because the fluctuation of the Se 
force is large. The cost to determine the force of the extra Wilson fermions is negligible compared 
to the overlap fermion part. The ratio of the step sizes are determined by monitoring the size of the 
forces. For example, Ax^pf2)/^(pfi) = 5 and Ax/ PF1 \/AXf GE \ = 6 are a reasonable choice for the 
displayed case. 

Noisy Metropolis test. Considering the performance of the solvers in Sec. [| the 5D CG method 
is preferable, with small number of poles N if possible. As for the preconditioner, we can choose 
relatively small N, since the contributions to the dynamics cancel in Sppi and Sp F 2- For Spp2, one 
can also choose an N with a less precise approximation by making use of the noisy Metropolis 



algorithm Jl6|], which is prescribed as follows. At the end of a molecular dynamics evolution, after 
the usual Metropolis test, we accept U new with a probability P = rnin{l,e _<is }, where 



ds = \w- 1 [u new ]w[u old }^\ 2 -\^\ 



(4.2) 



W = D(m)/D' (m), with D' a less accurate overlap operator used in HMC, and D the accurate 
overlap operator, U u is the initial gauge field, and E, is a random Gaussian noise vector. 

Performance. Finally, we show the present performance of HMC measured on the Blue Gene 
(512-node) at a ~ 0.12 fm, /I = 0.2, and a trajectory length x = 0.5. The first three lines in Table |l] 
show the result for the simulation with the 4D (relaxed CG) solver, with which most of gauge 
configurations are generated so far. No noisy Metropolis test is incorporated. The last three lines 
in Table |j] show a preliminary result for the performance with fully improved algorithm, the less 
precise 5D solver in molecular dynamics with the noisy Metropolis test, which achieves about a 
factor of 3 acceleration. Therefore this algorithm is our current best option, which will be adopted 
in our productive run in future. 

Numerical simulations are performed on Hitachi SRI 1000 and IBM Blue Gene at High Energy 
Accelerator Research Organization (KEK) under a support of its Large Scale Simulation Program 
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Nested CG 


0.015 


0.2 
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10 


0.87 


112 
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0.025 


0.2 
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10 


0.90 
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6 
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10 
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5D solver 


0.035 
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10 


0.68 


22 
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0.80 


26 
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6 


6 


10 


0.78 


23 



Table 1: Performance of HMC on Blue Gene (512-node) at a ~ 0.12 fm, pt = 0.2. Step-1 means the 
simulation with (4D) nested CG overlap solver, and Step-2 with the 5D overlap solver corrected by the noisy 
Metropolis test. 
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